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1 Introduction 



An increasing number of research and development efforts have recently posi- 
tioned themselves under the banner Language Engineering (LE). This signals 
a shift away from well-established labels such as Natural Language Process- 
ing (NLP) and Computational Linguistics. Examples include the renaming 
of UMlST'sQ Department of Language and Linguistics (location of the Centre 
for Computational Linguistics) as the Department of Language Engineering, 
and the naming of the European Commission's current relevant funding pro- 
gramme Language Engineering (the previous programme was called Linguistic 

Research and Engineering) . The new journal of Natural Language Engineering 
^The University of Manchester Institute of Science and Technology, Manchester, UK 
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is another example^. 

We shall argue here that this shift is more than simple TLA|-fatigue. The new 
name reflects a change of emphasis within the field towards: 

• increasing use of quantitative evaluation as a metric of research achieve- 
ment; 

• renewed interest in statistical language models and 
automatically-generated resources; 

• increasing availability and use of large-scale resources (e.g. corpora, 
machine-readable dictionaries); 

• a re-orientation of language processing research to large-scale applica- 
tions, with a comcomitant emphasis on predictability and conformance 
to requirements specifications (i.e. emphasis on engineering issues). 

Section g expands on these points, and the rest of the report then argues that 
this shift requires a more general approach to LE research and development, 
centred on the provision of support software in the form of a general archi- 
tecture and development environment specifically designed for text processing 
systems. Under EPSRCQ grant GR/K25267 the NLP group at the University 
of Sheffield are developing a system that aims to implement this new approach 
(Wilks, Gaizauskas 1994). The system is called GATE - the General Archi- 
tecture for Text Engineering. 

GATE is an architecture in the sense that it provides a common infrastructure 
for building LE systems (rather like the frame of a building or the interface 
specifications for the bus and peripherals of a computer). It is also a develop- 
ment environment that provides aids for the construction, testing and evalu- 
^The editorial of the first issue also discusses the new name (Boguraev, Garigliano, Tait 
1995). 

■^TLA: three-letter acronym 

''The Engineering and Physical Science Research Council, UK funding body. 
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ation of LE systems (and particularly for the reuse of existing components in 
new systems). 

Section ^ describes the architecture. Section ^ describes an initial application 
of GATE to collaborative research in Information Extraction (IE). Appendix 
discusses three existing systems: 

• ALEP (Simpkins 1992), which turns out to be a rather different enter- 
prise from ours; 

• MULTEXT (Thompson 1995a; Ballim 1995; Finch, Thompson, McK- 
elvie 1995), a different but largely complementary approach to some of 
the problems addressed by GATE, which is particularly strong on SGML 
support and elements of which we intend to integrate with GATE; 

• TIPSTER (ARPA 1993a), whose architecture (Grishman 1995) has been 
adopted as the storage substructure of GATE, and which has been a 
primary influence in the design and implementation of our system. 

Appendix ^ is a preliminary design and implementation document for GATE. 

2 Current trends in Language Engineering R&D 

We noted at the outset a recent trend towards re-positioning language pro- 
cessing R&D as Language Engineering. This should not be taken to imply or 
require the death of [Computational] Linguistics! The shift is quite possibly 
one of from theory to practice. This section examines the background and 
consequences of the trend. 
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Packing up the toys 



Several commentators have characterised the broad trend of AI approaches to 
language as tending towards the "toy problem syndrome" , expressing the view 
that AI has too often chosen to investigate artificial, small-scale apphcations of 
the technology under development. These "toy" problems are intended to be 
representative of the work involved in building applications of the technology 
for cnd-uscr or "real-world" tasks, but scaling up problem domains from the 
toy to the useful has often shown the technology developed for the toy to be 
unsuitable for the real job. 

For example, one of the present authors began a large-scale Prolog grammar 
project in 1985 (Farwell, Wilks 1989): by 1987 it was perhaps the largest 
DCG (Definite Clause Grammar) grammar anywhere, designed to cover a 
linguistically well-motivated test set of sentences in English. Interpreted by 
a standard parser it was able to parse completely and uniquely virtually no 
sentence chosen randomly from a newspaper. We suspect most large grammars 
of that type and era did no better, though reports are seldom written making 
this point. 

The mystery for linguists is how that can be: the grammar appeared to in- 
spection to be virtually complete - it had to cover English, if thirty years of 
linguistic intuition and methodology had any value. It is a measure of the 
total lack of evaluation of parsing projects up to that time that such conflicts 
of result and intuition were possible, a situation virtually unchanged since 
Kuno's large-scale Harvard parser of the 1960's (Kuno, Oettinger 1962) whose 
similar failure to produce a single, preferred, spanning parse gave rise to the 
AI semantics and knowledge-based movement. The situation was effectively 
unchanged in 1985 but the response this time around has been quite different. 
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characterised by: 

• use of empirical methods with strict evaluation criteria; 

• renewed interest in performance-based models of language, and a corre- 
sponding renewal and extension of statistical techniques in the area; 

• increased provision and reuse of large-scale data resources; 

• greater emphasis on the development of prototype applications of NLP 
technology to large-scale problems. 

Measuring results with numbers 

With hindsight it may seem obvious that computational linguistics, in the sense 
of computer programs that seek to exploit the results of linguistic research to 
make computers do useful things with human language^, should be subject to 
empirical criteria of effectiveness. The big problem, of course, is determining 
precisely what the criteria of success should be. Should we collect video tapes 
of Star Trek and measure our efforts in comparison to the Enterprise's lucid 
conversational computer? There is now a substantial literature on this question 
(Crouch, Gaizauskas, Netter 1995; EAGLES 1994; Galliers, Sparck Jones 1993; 
Palmer, Finin 1990; Sparck Jones 1994), and more practical solutions to the 
evaluation problem have emerged in a number of areas. 

Participants in the TIPSTER programme and the MUG (Message Understand- 
ing Gonference, an information extraction competition) and TREG (Text Re- 
trieval Gonference, an information retrieval (or 'document detection') com- 
petition) competitions (ARPA 1992, 1994), for example, build systems to do 

precisely-defined tasks on selected bodies of news articles. Human analysts 
^There is, of course, at least one other sense, that of using computational tools to aid 
linguistic research. 
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are employed to produce correct answers for some set of previously unseen 
texts, and the systems run to produce machine output for those texts. The 
performance of the systems relative to human annotators is then measurable 
quantitatively. Quantitative evaluation metrics bring numerically well-defined 
concepts like precision and recall, long used to evaluate information retrieval 
systems, to language engineering. 

It seems likely that the linking of quantitative performance metrics to funding, 
as is the case in the U.S., has fostered a culture willing to pursue any meth- 
ods that are effective in these terms even where theoretical purity suggests a 
different route. Whether this is a good or a bad thing is left as an exercise for 
the reader. We note, however, that the recent successes of speech recognition 
technology arose in a similar culture (Church, Mercer 1993). 

The U.S. model is not without significant disadvantages, however, principally: 

1. a tendency to exclude novelty as sites all focus on one set of tasks; 

2. the high cost of producing evaluation data and administering competitive 
evaluation. 

In the IE field (1) is evident in the current bias towards template-filling, an 
application designed at the behest of the U.S. intelligence community. 

Regarding (2), analysis of the funding required for a European equivalent to the 
American programmes has led the European Commission to reject comparative 
evaluation (Cencioni 1995). 

We shall argue in section ^ below that both of these problems can be offset 
while retaining the benefits of empirical evaluation. 

^Machine Translation systems had, of course, always been subject to rigourous evaluation 
from its earliest days (Lehrberger, Bourbeau 1988), but this tradition did not spread further 
until recently. 
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Performance vs. competence 



A related phenomenon is the increasing use of statistical techniques in the field 
(Jelinek 1985; Church, Mercer 1993). Instead of an introspective process of in- 
vestigation into the underlying mechanisms by which people process language 
(or, in Chomsky's terms, their competence), statistical NLP attempts to build 
models of language as it exists in practical use - the performance of language. 
(Reus Bod's thesis contains an extended discussion of this distinction - Bod 
1995.) 

Statistical methods have had significant successes, and the debate once thought 
closed by Chomsky's 'I saw a fragile whale' is now as open as it ever has 
been. Most part-of-speech taggers now rely on statistics (Leech, Garside, 
Atwell 1983; Robert, Armstrong 1995) and it seems possible that parsers may 
also go this way (Church 1998; Magerman 1994, 1995; Briscoe, Carroll 1993), 
though more conventional methods are also increasing in quality and robust- 
ness (Strzalkowski, Scheyen 1993). 

It is possible that there is a natural ceiling to the advance of performance 
models (Wilks 1994), but the point of relevance for this report is that the 
jury is still out on performance vs. competence. Thus, as well as a host of 
competing linguistic and lexicographic theories, LE is home to a thoroughgoing 
paradigm confiict. Two important consequences ensue. 

First, empirical measurement of the relative efficacy of competing techniques 
is even more important. Secondly, hybrid models are becoming common, im- 
plying a growing need for the flexible combination of different techniques in 
single systems. Numbers of techniques that have poor performance alone may 

sometimes be combined to produce a whole greater than the sum of the parts 
(Wilks, Guthrie, Guthrie, Cowie 1992; Bartell, Cottrell, Belew 1994). 
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Reusing resources in practice 

In common with other software systems, LE components deploy both data 
and process elements. The quality, quantity and availability of shared data 
resources has increased dramatically during the late 1980s and 1990^. 

The sharing of processing (or algorithmic) resources remains more limited 
(Cunningham, Freeman, Black 1994), one key reason being that the integra- 
tion and reuse of different components can be a major task. For example, 
the ESPRIT project PLUS devoted substantial effort to reusing a theorem 
prover from IBM's STUF system for parsing an HPSG grammar (Black ed. 
1991). The COBALT project (Rocca, Black, Cunningham, Zarri, Celnik 1993) 
failed to locate a reusable shallow analysis engine with a cost-benefit profile for 
reuse superior to reimplementation (Black, Cunningham 1993). The CRISTAL 
project planned to reuse results from those projects but again platform speci- 
ficity had a negative impact (Cunningham, Underwood, Black 1994). Section 
noted the increase in scale of the problems that LE research systems aim to 
tackle. In parallel with this trend, the overhead involved in creating a full-scale 
IE system, for example, is also increasing. For many research groups the costs 
are prohibitive. Any method for alleviating the problems of reuse would make 
a signifcant contribution to LE research and development. 

On a smaller scale, the typical life-cycle of doctoral research in AI/NLP is: 

• have an idea; 

• reinvent the wheel, fire and kitchen sinks to provide a framework for the 
idea; 

^Extensive discussion of the repositories (LDC, CLR, MLSR etc.) of corpora and lexicon 
resources and their hoUdings up to 1994 can be found in (Wilks et al. 1996). More recent 
developments concrening ELRA (the European Linguistic Resources Association can be 
found in Elsnews 4.5 (November 1995). 
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• program the idea; 

• publish; 

• throw the system to the dogs / tape archivist / shelfware catalogue. 

A framework which enabled relatively easy reuse of past work could signifi- 
cantly increase research productivity in these cases. 

Nuts and bolts 

With the increasing scale of LE systems, software engineering issues become 
more important. Just as the construction of the Severn Bridge was a rather 
different order of problem from that of laying a couple of planks across a 
farmland ditch, the development of software capable of processing megabytes of 
text, written by idiosyncratic wetware^, in short periods of time to measurable 
levels of accuracy is a quite different game from that, say, of providing natural 
language interaction for the control of a robot arm that moves blocks on a 
table top (Winograd 1972). The nuts and bolts are a lot bigger, and may 
even be of a completely different fabric altogether. This type of issue has 
been solved successfully in other areas of computer science, e.g. databases. 
Failure to address software-level robustness (as opposed to the robustness of 
the underlying NLP technology), quality and efficiency will be a barrier to 
transferring LE technology from the lab to marketplace. 

Some other requirements relating to the technological foundations of these 
systems also arise. Module interchangeability (at both the data and process 
levels), a kind of 'software lego' or 'plug-and-play', would allow users to buy 
into LE technology without tying them to one supplier. (In a different domain 

this was the message of the Open Systems movement. Let's hope we don't 
® Journalists. 
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emulate their success!) Also desirable are easy upgrade routes as technology 
improves. In addition to the reasons noted above, precise quantification of 
performance measures are also needed to foster confidence in the capability of 
LE applications to deliver, and robustness and efficiency for large text volumes 
are prerequisites for many applications. Software multilinguality and operat- 
ing system independence are also issues. Finally, maximising cross-domain 
portability will favourably impact delivery costs. (See (Grishman 1995) for a 
similar review of these points.) 

Gridlock on the super-highway 

Our discussion of trends in LE concludes with two major LE application ar- 
eas. Information Extraction (IE) and Machine Translation (MT), which both 
exhibit the trends discussed above. 

Recent years have seen significant improvements in the quality and robustness 
of LE technology. Rapid improvement in robustness (the ability to deal with 
any input) and quality are evident in the leading systems (Jacobs ed. 1992; 
ARPA 1992; ARPA 1993b; Strzalkowski, Scheyen 1993; Magerman 1995). In 
this year's MUC-6 competition (Sundheim 1995a,b; Onyshkevych 1995a,b) 
initial results indicate that named-entity recognition can now be performed 
by machines to performance levels equal those of people (ARPA 1996; Wakao, 
Gaizauskas, 1995). The result is that applications of the technology to large- 
scale problems are increasingly viable. 

IE is intended to deal with the rapidly growing problem of extracting meaning- 
ful information from the vast amount of electronic textual data that threatens 
to engulf us. Scientific journal abstracts, financial newswires, patents and 
patent abstracts, corporate and government technical documentation, elec- 
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tronic mail and electronic bulletin boards all contain a wealth of information 
of vital economic, social, scientific and technical importance. The problem is 
that the sheer volume of these sources is increasingly preventing the timely 
finding of relevant information, a state of affairs exacerbated by the explosive 
growth of the Internet (KroU 1994; Thompson 1995b). 

Existing information retrieval (Salton 1989) solutions to this problem are a 
step in the right direction, and the industry supplying IR applications can 
expect to continue in its current healthy state. 

IR systems, however, attempt no analysis of the meaningful content of texts. 
This is a strength of the approach, leading to robustness and speed, but also 
a weakness, as the information represented by the texts is retrieved in the 
format of the texts themselves - i.e. in the ambiguous and verbose medium of 
natural language. 

Extraction of information in definite formats is an obvious solution and one 
which can only be achieved through the application of LE technology. 

The IE community have been leaders in quantitative evaluation (ARPA 1993b). 
Statistical methods are widely used, but so is more conventional CL (ARPA 
1993). The need for systematic reuse of both data and processing resources 
has been recognised, and work funded to facilitate this, and the importance of 
software engineering matters noted (TIPSTER 1994). 

A similar situation is evident in MT research. Nyberg, Prederking, Farwell, 
Wilks (1994) note the continuing importance of evaluation for MT; Nirenburg, 
Mitamura, Carbonell (1994) propose that the multi-approach, multi-paradigm 
nature of the field be embodied in 'multi-engine', or 'adaptive' MT systems. 
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3 GATE 



The previous section argued that the new name, Language Engineering, re- 
flects changes of territory for natural language R&D, and drew out a set of 
requirements for LE systems. We believe that these requirements may best be 
met by the provision of dedicated support software for researchers and applica- 
tions developers, in the form of an architecture and development environment, 
and we have developed an initial version of such a system, called GATE - a 
General Architecture for Text Engineering. 

GATE is an architecture in the sense of providing a common infrastructure 
for building LE systems. An analogy is the hardware architecture of a PC: 
provided a manufacturer of, say multi-media controller cards follows the pub- 
lished specification of the PC bus, BIOS etc., the card should work in any 
machine. Further, the card should be able to rely on certain common services 
provided by the PC architecture. 

GATE is a development environment because it provides a variety of data vi- 
sualisation, debugging and evaluation tools (with point-and-click interface), 
and a set of standardised interfaces to reusable components. It supports the 
development of LE systems in a way analogous to the support for program 
development provided by compilers, libraries, debuggers and syntax-aware ed- 
itors. 

GATE will be available free for research purposes, and is intended to grow and 
develop in response to the needs of the UK and European LE communities. 
Its design incorporates results from related European and US initiatives and 
bridges the infrastructural work of the two. 

The rest of this section: 
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• casts the constraints on LE systems identified in section ^ as require- 
ments for GATE; 

• gives an overview of tlie arcliitecture in the context of the requirements; 

• describes the arrangements for collaborative work on IE using the initial 
distribution of GATE; 

• gives a roadmap of future development of the system. 

More detail on GATE can be found in Appendix 0, and on related work in 
Appendix 0. 

Summary of requirements 

A general architecture for LE R&D should: 

• support collaborative research; 

• support hybrid systems, 'plug-and-play' module interchangeability, and 
easy upgrading; 

• support the reuse of existing and future algorithmic components and data 
resources, whether they be the results of PhD projects or multinational 
strategic initiatives; 

• contribute to software-level robustness, quality and efficiency; 

• contribute to portability across problem domains and application areas; 

• support comparative evaluation, preferably at lower cost than the US 
programmes and without stifling innovation; 

• contribute to software portability across languages and across operating 
systems and programming languages. 
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Architecture overview 



GATE presents LE researchers or developers with an environment where they 
can use tools and linguistic databases easily and in combination, launch dif- 
ferent processes, say taggers or parsers, on the same text and compare the 
results, or, conversely, run the same module on different text collections and 
analyse the differences, all in a user-friendly interface. Alternatively, module 
sets can be strung together to make e.g. IE, IR or MT systems. Modules 
and systems can be evaluated (using e.g. the Parseval tools), reconfigured and 
reevaluated - a kind of edit /compile/test cycle for LE components. 




GDM 



CREOLE GGl 

GDM - the GATE Document Manager 
GGl - the GATE Graphical Interface 

CREOLE - a Collection of REusable Objects for Langauge Engineering 

Figure 1: The three elements of GATE 

GATE comprises three principal elements (figure |I|): 

• a database for storing information about texts and a database schema 
based on an object-oriented model of information about texts (the GATE 
Document Manager - GDM); 

• a graphical interface for launching processing tools on data and viewing 
and evaluating the results (the GATE Graphical Interface - GGl); 
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• a collection of wrappers for algorithmic and data resources that inter- 
operate with the database and interface and constitute a Collection of 
REusable Objects for Language Engineering - CREOLE. 



GDM is based on the TIPSTER document manager, and the initial imple- 
mentation supplied by the Computing Research Lab at New Mexico State 
University (whose help we gratefully acknowledge). It is planned to enhance 
the SGML capabilities of this model, possibly by exploiting the results of the 
MULTEXT project (we thank colleagues from ISSCO and Edinburgh for mak- 
ing available documentation and advice on this subject). See Appendix ^ for 
details of the relationship between GATE and these (and other) projects. 

GDM provides a central repository or server that stores all information an LE 
system generates about the texts it processes. All communication between 
the components of an LE system goes through GDM, insulating parts from 
each other and providing a uniform API (applications programmer interface) 
for manipulating the data produced by the system.^ Benefits of this approach 
include the ability to exploit the maturity and efficiency of database technol- 
ogy, easy modelling of blackboard-type distributed control regimes (of the type 
proposed by: Boitet, Seligman 1994; section on control in Black ed. 1991) and 
reduced interdependence of components. 

GGI is in development at Sheffield. It is a graphical launchpad for LE sub- 
systems, and provides various facilities for viewing and testing results and 
playing software lego with LE components: interactively stringing objects into 
different system configurations. 

All the real work of analysing texts (and maybe producing summaries of them, 
^Where very large data sets need passing between modules other external databases can 
be employed if necessary. 
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or translations, or SQL statements. . . ) in a GATE-based LE system is done 
by CREOLE modules. 

Note that we use the terms module and c»6jec^ rather loosely to mean interfaces 
to resources which may be predominantly algorithmic or predominantly data, 
or a mixture of both. We exploit object-orientation for reasons of modularity, 
coupling and cohesion, fluency of modelling and ease of reuse (see e.g. Booch 
1994). 

Typically, a CREOLE object will be a wrapper around a pre-existing LE 
module or database - a tagger or parser, a lexicon or ngram index, for example. 
Alternatively objects may be developed from scratch for the architecture - in 
either case the object provides a standardised API to the underlying resources 
which allows access via GGI and I/O via GDM. The CREOLE APIs may also 
be used for programming new objects. 

The initial release of GATE will be delivered with a CREOLE set comprising 
a complete MUC-compatible IE system (to begin with, more of a pidgin than 
a Creole!). Some of the objects will be based on freely available software 
(e.g. the Brill tagger (Brill 1994)), while others are derived from Sheffield's 
MUC-6 entrant, LaSIEQ (Gaizauskas, Humphreys, Wakao, Cunningham 1995; 
Gaizauskas, Humphreys, Wakao, Cunningham 1996). This set is called VIE 
- a Vanilla IE system. See section ^ for an overview. CREOLE will expand 
quite rapidly during 1996, to cover a wide range of LE R&D components (such 
as those currently available at the ACL-sponsored Natural Language Software 
Registry at DFKI[^, but for the rest of this section we'll use IE as an example 
of the intended operation of GATE. 
lOLarge-Scale IE. 

^^URL: http:/ /cl-www. dfki.uni-sb.de/cl/registry/draft. html 
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The recent MUC competition, the sixth, defined four IE tasks to be carried 
out on Wall Street Journal articles. Sheffield's system did well, scoring in the 
middle of the pack in general and doing as well as the best systems in some 
areas. Developing this system took 24 person-months, one significant element 
of which was coping with the strict MUC output specifications. What does 
a research group do which either does not have the resources to build such 
a large system, or even if it did would not want to spend effort on areas of 
language processing outside of its particular specialism? The answer until now 
has been that these groups cannot take part in large-scale system building, 
thus missing out on the chance to test their technology in an application- 
oriented environment and, perhaps more seriously, missing out on the extensive 
quantitative evaluation mechanisms developed in areas such as MUC. In GATE 
and VIE we hope to provide an environment where groups can mix and match 
elements of MUC technology from other sites (including ours) with components 
of their own, thus allowing the benefits of large-scale systems without the 
overheads. A parser developer, for example, can replace the parser supplied 
with VIE. 

Licencing restrictions preclude the distribution of MUC scoring tools with 
GATE, but Sheffield will arrange for evaluation of data produced by other sites. 
In this way, GATE/ VIE will support comparative evaluation of LE components 
at a lower cost than the ARPA programme (partly by exploiting their work, of 
course!). Because of the relative informality of these evaluation arrangements, 
and as the range of evaluation facilities in GATE expands beyond the four IE 
tasks of the current MUC, we should also be able to offset the tendency of 
evaluation programmes to dampen innovation. 

Similarly we aim to make collaboration between research groups much easier. 
Sites specialising on different LE subtasks can combine their efforts into bigger 



17 



application-oriented systems with minimal overhead. We hope that we can 
help the community squeeze a little more research time out of industrially- 
oriented projects by cutting down on the time spent integrating research work 
into demonstrator systems. 

Working with GATE/ VIE, the researcher will from the outset reuse existing 
components, the overhead for doing so being much lower than is conventionally 
the case - instead of learning new tricks for each module reused, the common 
APIs of GDM and CREOLE mean only one integration mechanism must be 
learnt. And as CREOLE expands, more and more modules and databases 
will be available at low cost. We also endorse object orientation (00) in this 
context, as an enabling technology for reuse (Booch 1994), and hope to move 
towards sub-component level reuse at some future point, possibly providing 
C-|— I- libraries as part of an 00 LE framework (Cunningham, Freeman, Black 
1994). 

As we built our MUC system it was often the case that we were unsure of 
the implications for system performance of using tagger X instead of tagger 
Y, or gazeteer A instead of pattern matcher B. In GATE, substitution of 
components is a point-and-click operation in the GGI interface. (Note that 
delivered systems, e.g. EC project demonstrators, can use GDM and CREOLE 
without GGI - see below.) This facility supports hybrid systems, ease of 
upgrading and open systems-style module interchangeability. 

Of course, GATE does not solve all the problems involved in plugging diverse 
LE modules together. There are two barriers to such integration: 

• incompatability of representation of information about text and the mech- 
anisms for storage, retrieval and inter-module communication of that 
information; 

• incompatability of type of information used and produced by different 
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modules. 

GATE enforces a separation between these two and provides a solution to 
the former based on the work of the TIPSTER architecture group (TIPSTER 
1994). Because GATE places no constraints on the linguistic formalisms or 
information content used by CREOLE objects, the latter problem must be 
solved by dedicated translation functions - e.g. tagset-to-tagset mapping - 
and, in some cases, by extra processing - e.g. adding a semantic processor 
to complement a bracketing parser in order to produce logical form to drive a 
discourse interpreter. As more of this work is done we can expect the overhead 
involved to fall, as all results will be available as CREOLE objects. In the early 
stages Sheffield will provide some resources for this work in order to get the 
ball rolling, i.e. we will provide help with CREOLEising existing systems 
and with developing interface routines where practical and necessary. We are 
confident that integration is possible (partly because we believe that differences 
between representation formalisms tend to be exaggerated) - and others share 
this view, e.g. the MICROKOSMOS project (Beale, Nirenburg, Mahesh 1995), 
which seeks to integrate many types of knowledge source in a useable whole, 
as well as the LexiCadCam experience at New Mexico (Wilks, Guthrie, Slator 
1996) which sought to provide core lexical information as needed in a range of 
user-specified formats. 

GATE is also intended to benefit the LE system developer (which may be the 
LE researcher with a different hat on, or industrialists implementing systems 
for sale or for their own text processing needs). Using GATE for the delivery 
of a system is illustrated in figure ^ A delivered system comprises a set of 
CREOLE objects, the GATE runtime engine (GDM and associated APIs) and 
a custom-built interface (maybe just character streams, maybe a Visual Basic 
Windows GUI, . . . ). The interface might reuse code from GGI, or might be 
developed from scratch. 
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Creates / reuses 



Deliverable LE 
system 




Figure 2: Delivering systems with GATE 



The LE user may upgrade by swapping parts of the CREOLE set if better 
technology becomes available elsewhere. This model for the commercialisa- 
tion of LE technology is already begininning to operate in the US, where a 
number of organisations are preparing TIPSTER-compatible modules for sale 
or distribution for research. (These organisations include NMSU, SRA, HNC, 
University of Massachusetts, Paracell, Logicon (Dunning 1995, personal com- 
munication).) All TlPSTER-compatibile modules will also work with GATE 
as GATE itself is desinged to be a TIPSTER-compatible system. Thus the 
pool of easily reusable LE resources available to researchers and developers 
using GATE has the potential to become a large, rich set of modules from a 
good proportion of the LE community world-wide. Also, it may well become 
the case that organisations purchasing LE software will require TIPSTER 
compatability (this will be true of US government organisations, for example). 

At the software engineering level GATE: 
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• contributes to robustness and quality by providing a mature infrastruc- 
ture; 

• contributes to efficiency via the design of the TIPSTER text model and 
by access to fast database technology underlying GDM. 

As regards operating system independence, GDM and GGI will initially be 
available for Linux, SunOS, Solaris 2 and other UNIX platforms as required, 
but will avoid using UNIX-specific facilities. A Windows version may follow 
at some point (sec the roadmap section below). CREOLE portability is more 
difficult. GATE places no constraints on the implementation languages and 
platforms of CREOLE objects, so they may or may not be portable. 

GATE cannot ehminate the overheads involved with porting LE systems to 
different domains (e.g. from financial news to medical reports). Tuning LE 

system resources to new domains is a current research issue (see also: the 
LRE DELIS and ECRAN projects; Evans, Kilgariff 1995). The modularity of 
GATE-based systems should, however, contribute to cutting the engineering 
overhead involved. 

Collaboration using GATE and VIE 

Sheffield will support collaborative work using GATE/ VIE for LE research 
groups (typically academic groups), businesses with IE needs and producers 
of lexicons and dictionaries. The three groups are technology, data and re- 
source providers respectively, contributing CREOLE modules, test data (e.g. 
manually extracted information and the relevant source texts) and machine- 
readable language resources (e.g. dictionaries). The projected benefits for 
participants include: 

• comparative quantitative evaluation of candidate technologies for IE; 
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• technology providers may specialise on components of the IE task, avoid- 
ing the overhead of providing a complete IE system while still working 
within the framework of a complete NLP application; 

• data providers (typically industrial concerns) get access to IE technology 
applied to their particular textual problem domains; 

• resource providers can assess the performance of their products and in- 
crease the market for them by encouraging their use in applied LE sys- 
tems. 

Note that there will be no requirement to supply source code for contributed 

modules, and that intellectual property and other rights will be safeguarded 
by appropriate legal agreements. 

Roadmap 

Our first goal for GATE is to provide a prototype of the architecture along 
with a set of CREOLE objects for doing MUC-style information extraction. 
GATE/VIE 1.0 will be available at the start of 1996 to research groups and 
development projects who wish to participate in IE systems development. We 
will at that point sohcit contributions of CREOLE replacements for VIE mod- 
ules, and data sets from organisations with IE needs. 

Initial versions will run under UNIX and Xll only, and support the gcc C[++] 
compilers and Tel 7.4 / Tk 4.0 (Ousterhoot 1994) and higher. 

Subsequent developments will concentrate on expanding the set of CREOLE 
objects in order to: 

• exemplify the use of GATE in other apphcation areas, e.g. MT, CALL, 

Speech research; 
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• make GATE a standard resource repository via CREOLE wrappers for 
LE resources like lexicons, grammars etc. (possibly in collaboration with 
the newly-formed European Language Resources Association). 

Sheffield will contribute resources to integration of other sites' components to 
start with. 

On the technical side issues include: 

• 16 bit character support; 

• internationalisation of system messages to increase ease of use for non- 
native English speakers (probably taking into account the results of LRE 
project 61-003 GLOSSASOFT (Hudson 1995)); 

• evaluation and revision of the GGI interface, and the addition of further 
data visualisation and debugging facilities; 

• SGML support; 

• portability to other platforms. 

We envisage considerable input from other research groups and welcome crti- 
cism and comment on our implementation (and offers of work!). 

4 VIE, a Vanilla Information Extraction sys- 
tem 

As originally envisaged (Wilks, Gaizauskas 1994), GATE will be distributed 
with a set of CREOLE objects that together implement a complete information 
extraction system capable of producing results compatible with the MUC-6 
task definitions. This CREOLE set is called VIE, a Vanilla IE system, and it 
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is intended that participating sites use VIE as the basis for speciahsing on sub- 
tasks in IE. By replacing a particular VIE module - the parser, for example 
- a participating group will immediately be able to evaluate their specialist 
technology's potential contribution to full-scale IE applications. Sheffield has 
access to the MUC-6 scoring tools (and the PARSEVAL software) and will 
run periodic evaluations of various VIE-based configurations. 

It is envisaged that LE research groups (typically academic research groups) 
supply modules to replace parts of VIE. Businesses with IE needs can also 
participate in the programme by contributing test sets and task definitions. 
Resource builders like dictionary publishers will be approached to supply re- 
search versions of their online texts. 

The most recent MUC competition, MUC-6, defined four tasks to be carried 
out on Wall Street Journal articles: 

• named entity (NE) recognition, the recognition and classification of def- 
inite entities such as names, dates, places; 

• coreference (CO) resolution, the identification of identity relations be- 
tween entities (including anaphoric references to them); 

• template element (TE) construction, a fixed-format, database-like enu- 
meration of organisations and persons; 

• scenario template (ST) construction, the detection of specific relations 
holding between template elements relevant to a particular information 
need (in this case personnel joining and leaving companies) and con- 
struction of a fixed-format structure recording the entities and details of 
the relation. 

VIE is an integrated system that builds up a single, rich model of a text which 
is then used to produce outputs for all four of the MUC-6 tasks. Of course 
this model may also be used for other purposes aside from MUC-6 results 
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generation, for example we currently generate natural language summaries of 
the MUC-6 scenario results. 

Put most broadly, and superficially, VIE's approach involves compositionally 
constructing semantic representations of individual sentences in a text accord- 
ing to semantic rules attached to phrase structure constituents which have 
been obtained by syntactic parsing using a corpus-derived context-free gram- 
mar. The semantic representations of successive sentences are then integrated 
into a 'discourse model' which, once the entire text has been processed, may 
be viewed as a specialisation of a general world model with which the system 
sets out to process each text. 

Features which distinguish the system are: 

• an integrated approach allowing knowledge at several linguistic levels to 
be applied to each MUC-6 task (e.g. coreference information is used in 
named entity recognition); 

• the absence of any overt lexicon - lexical information needed for parsing 
is computed dynamically through part-of-speech-tagging and morpho- 
logical analysis; 

• the use of a grammar derived semi- automatically from the Penn Tree- 
Bank corpus; 

• the use and dynamic acquisition of a world model, in particular for the 
coreference and scenario tasks; 

• a summarisation module which produces a brief natural language sum- 
mary of scenario events. 

VIE will be available with the initial release of GATE. 
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5 Conclusion 



We have argued that the language processing field is in a state of rapid change, 
and that the focus is shifting to large-scale applications and systems that are 
beginning to produce marketable solutions. The new emphasis has generated 
a new name - Language Engineering. 

We suggest that a new approach to software support for LE R&D should be 
developed to parallel this shift. We have proposed an architectural solution - 
GATE - grounded on previous work in the area. 

GATE aims to be a standard architecture for LE systems. Standards, of 
course, must sell themselves - imposition rarely works (whether in computer 
science or in real life!).[3 We hope that the LE community will endorse our 
argument for an LE support architecture, and that our implementation will 
be strong enough to fulfil the promise of the idea. 
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APPENDICES 
A Related work 

We discuss three systems here, ALEP, MULTEXT and the TIPSTER Archi- 
tecture. 

ALEP 

ALEP (Simpkins 1992) is an EC project which aims to provide an Advanced 
Language Engineering Platform - superficially a similar goal to ours. The 
approaches arc quite different, however. ALEP is an advanced system for 
developing and manipulating feature structure knowledge-bases under unifi- 
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cation. Also provided are several parsing algorithms, algorithms for transfer, 
synthesis and generation (Schiitz 1994). As such it is a system for develop- 
ing particular types of data resource and for doing a particular set of tasks 
in LE in a particular way. ALEP does not aim for complete genericity (or it 
would be in the business of providing algorithms for Baum- Welch estimation, 
or fast regular expression matching, or ...). Supplying a generic system to 
do every LE task is clearly impossible, and prone to instant obsolescence in a 
rapidly changing field. GATE, in contrast, is a shell, a backplane into which 
the whole spectrum of LE modules and databases can be plugged. Compo- 
nents used within GATE will typically exist already - our emphasis is reuse, 
not reimplementation. Our project is to provide a flexible and efficient way 
to combine LE components to make LE systems (whether experimental or for 
delivered applications) - not to provide 'the one true system', or even 'the 
one true development environment': ALEP-based systems might well provide 
components operating within GATE. 

The ALEP enterprise, then, is orthogonal to ours - there is no significant 
overlap or confiict. 



MULTEXT 



MULTEXT (Ballim 1995; Thompson 1995b; Finch, Thompson, McKelvie 
1995) is another EC project, whose objective is to produce tools for multi- 
lingual corpus annotation and sample corpora marked-up according to the 
same standards used to drive the tool development. Annotation tools under 
development perform text segmentation, POS tagging, morphological analysis 
and parallel text alignment. The project has defined an architecture centred 
on a model of the data passed between the various phases of processing im- 
plemented by the tools. 
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MULTEXT is based on SGML, the Standard Generalised Markup Language 
(Goldfarb 1990). SGML works by adding extra information to texts in a 
standard format. For example, the (rather short) news article 

Renter 

Dog bites man. 
Newshound implicated. 

might appear in SGML as 

<DOC> 

<HEADERS>Reuter< /HEADERS> 
<SENT>Dog bites man.</SENT> 
<SENT>Newshound implicated. </SENT> 
</DOC> 

Markup is between chevrons, '<' and '>'; slashes signify the end of a marked- 
up entity. The language is information- neutral (the tags 'DOC', 'SENT' etc. 
are not part of the language definition) and is encoded in whatever character 
set the source text originates in (e.g. ASCII). 

The MULTEXT architecture is based on a commitment to TEI-style (the Text 
Encoding Initiative (Sperberg-McQueen, Burnard 1994)) SCML encoding of 
information about text. The TEI defines standard tag sets for a range of 
purposes including many relevant to LE systems. Tools in a MULTEXT sys- 
tem communicate via interfaces specified as SGML document type definitions 
(DTDs - essentially tag set descriptions), using character streams on pipes - 
an arrangement modelled after UNIX-style shell programming. This UNIX 
flavour was also apparent in provision for record/field encoding (records = 
lines of text; fields = whitespace-separated character groups) of markup as 
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an interchangeable alternative to SGML (though this was dropped from later 
versions of the tools), and in provision of an SGML-aware version of sed, the 
UNIX pattern search and replace tool. Organisational problems have, unfor- 
tunately, led to an early termination of the project, but the tools and the 
architecture they run in will still be completed and distributed for research 
purposes. 

MULTEXT endorses the view that SGML is an appropriate and flexible lan- 
guage for the splitting and recombination of text analysis elements. A tool 
selects what information it requires from its input SGML stream and adds 
information as new SGML markup. An advantage here is a degree of data- 
structure independence: so long as the necessary information is present in its 
input, a tool can ignore changes to other markup that inhabits the same stream 
- unknown SGML is simply passed through unchanged. A disadvantage is that 
although graph-structured data may be expressed in SGML, doing so is com- 
plex (either via concurrent markup, the specification of multiple legal markup 
trees in the DTD, or by rather ugly nesting tricks to cope with overlapping, 
aka "milestone tags"). Graph-structured information might be present in the 
output of a parser, for example, representing competing analyses of areas of 
text. 

Another feature of MULTEXT is a set of abstract data types (ADTs) for all 
tool I/O (Ballim 1995) supported by a single shared API (Application Pro- 
gram(mers') Interface) for access to the types. An executive (the tool shell) 
glues tools together in particular configurations according to user specifiac- 
tions. The shell may extract sub-trees from SGML documents to reduce the 
I/O load where tools only require a subset of a marked-up document. 

The ADT set forms an object-oriented modelQ of the data present in a marked- 
""^•^OO in the sense of using inheritance and data encapsulation. 
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up document. Example classes include Sentence, SentenceBlock (sequence 
of sentences) , LexicalWord (word plus definition from a lexicon) . The ADT 

model reflects the type of processing available in the tool set - there is a type 
TaggedSentence, for example, but not a PatrsedSentence. 

Finally, MULTEXT has developed some general support infrastructure for 
handling SGML and for parallelising tool pipelines. A query language for 
accessing components of SGML documents is defined and API in support 
of this language provided. For example a program might specify parts of a 
document by the pattern DOC/*/s which refers to all <s> objects under <DOC> 
tags - all SGML-marked sentences in the document. Additionally SGML- 
aware versions of various UNIX utilities are in development. Parallel execution 
may be supported at the level of single tools via a program that distributes 
pipelined operations over a set of networked machines. 

MULTEXT is implemented for the UNIX platform. Access to tools is as uni- 
tary programs and via the tool shell; the SGML query language is supported 
by a C API. The consortium has declared an intention to make implemen- 
tations generally available, and although the project is finishing early due to 
logistic difficulties, most tools and the support shell will continue development 
at ISSCO for release early in 1996. 

Summary: 

• MULTEXT tools operate on SGML streams. 

• An object model of the data in those streams is defined, along with 

• an API to access the data. 

• An API and query language for accessing components of SGML docu- 
ments is provided along with 

• various useful SGML-aware tools. 
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TIPSTER II 



The TIPSTER programme in the US, currently in its second phase, has also 
produced a data-driven architecture for NLP systems (Grishman, Dunning, 
Callan 1995; TIPSTER Architecture Committee 1994). Like MULTEXT, 
TIPSTER addresses specific forms of language processing, in this case in- 
formation extraction and document detection (or information retrieval - IR). 
As will become clear below, however, TIPSTER's approach is not restricted 
to particular NL tasks. 

Whereas in MULTEXT all information about a text is encoded in SGML, 
which is added by the tools, in TIPSTER a text remains unchanged while 
information is stored in a separate database in the form of annotations. An- 
notations associate portions of documents (identified by sets of start/end byte 
offsets or spans) 

with analysis information (attributes) , e.g.: POS tags; textual unit type; tem- 
plate element. In this way the information built up about a text by NLP (or 
IR) modules is kept separate from the texts themselves. In place of an SGML 
DTD an annotation type declaration defines the information present in anno- 
tation sets, for example a set of values for MUC-style organisation template 
elements. Figure^ shows an example from (Grishman, Dunning, Callan 1995). 
SGML I/O is catered for by API calls to import and export SGML-encoded 
text. 

The definition of annotations in TIPSTER forms part of an object-oriented 
model that deals with inter-textual information as well as single texts. Docu- 
ments are grouped into collections, each with a database storing annotations 
and document attributes such as identifiers, headlines etc. Collections are the 
first-class entities in the architecture. The model also describes elements of 
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Figure 3: TIPSTER annotations example 

IE and IR systems relating to their use, with classes representing queries and 
information needs. 

The TIPSTER architecture is designed to be portable to a range of operating 
environments, so it does not define implementation technologies. Particular 
implementations make their own decisions regarding issues such as parallelism, 
user interface, or delivery platform. An implementation in C and Tel (Ouster- 
hoot 1994) from CRL (the Computing Research Lab, New Mexico State Uni- 
versity) implements client-server operation (using Tcl-dp), a server database 
manager fielding requests from client modules. 

This implementation is available now and includes both C and Tel APIs. It 
is not currently portable beyond UNIX, though Tcl/Tk is becoming available 
on Windows and Macintosh. 
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The architecture was the result of unpaid collaboration between a large number 
of ARPA-supported sites in the US. 

Comparison of MULTEXT and TIPSTER 

Both projects propose architectures appropriate for LE, but there are a number 
of significant differences. We discuss seven here, then note the possibility of 
complimentary interoperation of the two. 

1 . MULTEXT adds new information to documents by augmenting an SGML 
stream; TIPSTER stores information remotely in a dedicated database. 
This has several imphcations. Firstly, TIPSTER can support documents 
on read-only media (e.g. CD-ROMs, which may be used for bulk stor- 
age by organisations with large archiving needs, even though access will 
then be slower than from hard disk) . Secondly, TIPSTER avoids the dif- 
ficulties referred to earlier of representing graph-structured information 
in SGML. Prom the point of view of efficiency, the original MULTEXT 
model of interposing SGML between all modules implies a generation 
and parsing overhead in each module. Later versions have replaced this 
model with a pre-parsed representation of SGML to reduce this over- 
head. This representation will presumably be stored in intermediate files, 
which implies an overhead from the I/O involved in continually reading 
and writing all the data associated with a document to file. There would 
seem no reason why these files should not be replaced by a database 
implementation, however, with potential performance benefits from the 
ability to do I/O on subsets of information about documents (and from 
the high level of optimisation present in modern database technology). 

2. A related issue is storage overhead. TIPSTER is minimal in this re- 
spect, as there is no inherent need to duplicate the source text (which 
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also means that it works naturally with read-only media like CD-ROMs) . 
MULTEXT potentially has to duplicate the source text at each interme- 
diary stage, although this might be ameliorated by shifting to a database 
implementation. 

3. tipster's data architecture is process-neutral - the objects in the 
model are generic to all information that is associated with definite 
ranges of text. (The more concrete aspects of the architecture to do 
with IE and IR model the objects involved in user interaction with such 
systems.) MULTEXT's model is tool-specific, as noted above (although 
the underlying representation language, SGML, is information- neutral) . 

4. There is no easy way in an SGML-based system to differentiate sets of 
results (i.e. sets of markup) by e.g. the program or user that originated 
them. In general, storing information about the information present in 
an SGML system (or meta-information) is messy. This is a problem for 
MULTEXT but not for TIPSTER. A related point is that TIPSTER 
can easily support multi-level access control via a database's protection 
mechanisms - this is again not straightforward in SGML. 

5. Distributed control is easy to implement in a database-centred system 
hke TIPSTER - the DB can act as a blackboard, and implementations 
can take advantage of well-understood access control (locking) technol- 
ogy. How to do distributed control in MULTEXT is not obvious. 

6. TIPSTER provides no tools or databases, but many sites are already 
committed to TIPSTER-compatibihty, so the set of modules available 
in the framework will grow over time. MULTEXT is based around a set 
of tools and reference corpora annotated accordingly. 

7. Working implementations of TIPSTER have been available for some 
months now; MULTEXT will be distributed in 1996. 
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Interestingly, a TIPSTER system could function as a module in a MULTEXT 
system, or vice- versa. A TIPSTER storage system could write data in SGML 
for processing by MULTEXT tools, and convert the SGML results back into 
native format. Also, the extensive work done on SGML processing in MUL- 
TEXT could usefully fill a gap in the current TIPSTER model, in which SGML 
capability is not fully specified (plans are currently being formed in the US 
to address this problem - input from European experience would seem advis- 
able). Integration of the results of both projects would seem to be the best of 
both worlds, and we hope to achieve this in GATE. 

Note that we believe that SGML and the TEI must remain central to any 
serious text processing strategy. The points above do not contradict this view, 
but indicate that SGML should not form the central representation format of 
every text processing system. Input from SGML text and TEI conformant out- 
put are becoming increasingly necessary for LE applications as more and more 
publishing adopts these standards. This does not mean, however, that flat-file 
SGML is an appropriate format for an architecture for LE systems. This ob- 
servation is born out by the fact that TIPSTER started with an SGML/TEI 
architecture but rejected it in favour of the current database model, and that 
MULTEXT has gone halfway to this style by passing pre-parsed SGML be- 
tween components. 



B GATE — design and implementation 

Note: this appendix is a preliminary version of design and implementation 
documentation for GATE and VIE. It is a) incomplete and speculative, and 

b) repeats some material from earlier sections of the report. 
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Architecture overview 

GATE is based on a combination of the TIPSTER and MULTEXT models. 
The centrepiece of the architecture is a TIPSTER-style document management 
database, chosen for reasons of efficiency, maturity of implementation and 
ease of extensibility. The current TIPSTER model provides hooks for the 
incorporation of SGML support but has not fully developed this aspect of the 
architecture (see above). We plan to capitalise on the work done in MULTEXT 
to augment the SGML capabilities of the TIPSTER architecture, probably via 
the development of a unified API. This unification will not be available in the 
initial release of GATE (early 1996), however. 

GATE will form a bridge between the American work and the European work 
on SGML and conformance to the TEI guidelines and DTDs. 

Components 

GATE comprises three principal components (see figure 

GDM — the GATE document manager, based on the TIPSTER document 
manager with added SGML capabilities (and using an implementation 
from CRL at NMSU, whose assistance we gratefully acknowledge); 

GGI — the GATE graphical interface, a development tool for LE R&D, pro- 
viding integrated access to the services of the other components and 
adding visualisation and debugging tools; 

CREOLE — a Collection of REusable Objects for Language Engineering: the 
set of modules integrated with the system. CREOLE comprises wrap- 
pers for existing modules, which may or may not require changing, plus 
modules developed explicitly for GATE compliance. Some objects are 
process-orientated, some data-oriented. 
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The first distribution of GATE will be configured as a support tool for collabo- 
rative R&D in Information Extraction by the inclusion of a CREOLE set that 
implements a full-scale MUC-compatible IE system (called VIE, the Vanilla 
IE system - see section |). 




CREOLE GGl 



GDM - the GATE Document Manager 
GGl - the GATE Graphical Interface 

CREOLE - a Collection of REusable Objects for Langauge Engineering 

Figure 4: The three elements of GATE 

MULTEXT integration will involve: 

• creating CREOLE object wrappers for the tool set; 

• providing SGML I/O and SGML manipulation via and API based on 
the MULTEXT work. 

Integrating CREOLE objects 

As noted above, GATE is not a system for doing LE, but a backplane to 
assemble processing modules and databases to form LE systems (whether ex- 
perimental or for end-user delivery). The analogy here is with extensible com- 
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puter hardware architectures - expansion cards in a PC, for example. Just as 
producing a card to do fast video off a VESA bus or to drive a serial line from 
an ISA slot means conforming to the protocols defined by those architectures, 
so integrating LE objects in GATE (i.e. producing a new member of CRE- 
OLE) imposes some interface constraints. These constraints are in the form of 
functions that must be available for GDM and GGI, and are described here. 

When the user initiates a particular CREOLE object via GGI (or when a pro- 
grammer does the same via the GATE API when building an LE application) 
the object is initialised using the standard calls provided in the CREOLE 
wrapper. The object then runs, obtaining the information it needs (document 
source, annotations from other objects) via calls to the GDM API. Its results 
are then stored in the GDM database and become available for examination 
via GGI or to be the input to other CREOLE objects. 

Figure | shows the two ways to provide the CREOLE wrapper functions. 
Packages written in C or in languages which can be used as libraries with 
C linkage conventions can be compiled into GATE directly as a Tel package 
(see Ousterhout 1994 chapter 31). This is tight coupling (route 2 in the di- 
agram). Alternatively the underlying implementation of services can be via 
an executable {loose coupling, route 1). This executable is then called by 
the CREOLE wrapper code. In either case the implementation of CREOLE 
services is completely transparent to GATE. 

CREOLE wrappers encapsulate information about the preconditions for a 
module to run (data that must be present in the GDM database) and post- 
conditions (data that will result). This information is needed by GGI - see 
below. Note that aside from the information needed for GGI to provide access 
to a module, GATE compatability equals TIPSTER compatability - i.e. there 
will be very little overhead in making any TIPSTER module run in GATE. 
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Route 1 (loose coupling) 



Route 2 (tight coupling) 



executable 



C[++] library 



calls 



results / 
I byte offsets 



Tel wrapper code 



Tel wrapper code 



GATE Tel API: GDM & GGI services 



TDM DBMS 



GGI Tcl/Tk package 



CREOLE 



GGI/ 
GDM 



Figure 5: CREOLE integration routes 



In addition to the macro requirements on CREOLE integration described 
above, GDM imposes constraints on the I/O format of CREOLE objects, 
namely that all information must be associated with byte offsets and conform 
to the annotations model of the TIPSTER architecture (see appendix The 
principal overhead in this process is making the components being integrated 
use byte offsets, if they don't already do so. Where components use SGML, 
I/O filters will convert markup to the TIPSTER style. 

As we noted above CREOLE objects may be data-orientated. It is our inten- 
tion to integrate as large a set of LE data resources as possible within GATE 
in order to reduce the overhead of installing and understanding the software 
interfaces of these resources. For example, the Wordnet thesaurus (Miller, 
Beckwith, Fellbaum, Gross, Miller 1993) will be given a CREOLE wrapper 
encapsulating the C API as a GATE service. Grammars, lexica, gazetteers - 
all are candidates for CREOLE integration, and as the set expands GATE can 
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become a standard resource repository for LE data as well as LE processing 
modules. 

GGI 

GGI is a graphical tool that encapsulates the GDM and CREOLE resources 
in a fashion suitable for interactive building and testing of LE components 
and systems. The philosophy is to provide a rich set of tools including but 
not limited to the CREOLE modules. So, for example, access to a KWIC tool 
or the WordNet interface is included, as well as taggers, parsers, etc. from 
CREOLE. 

GGI is intended for developers. Delivered systems built on GATE will not 
generally use GGI (though they may be able to reuse parts of the interface for 
their own front-ends). 

GGI adopts the OSF Motif look and feel, provided via the Tcl/Tk toolkit (as 
used, for example, in Netscape). 

GGI 

The current version of the interface is has gone through several redesign iter- 
ations based on feedback on initial prototypes. 

Launching CREOLE processes is done via a partially connected graph of possi- 
ble paths through the processes embodied by the systems and modules menus 
^'*Tcl/Tk will be available in native look and feel for PC/ Windows and Macintosh some 
time in 1996, so GATE may at that point be able to migrate to these platforms. 
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of 0.1.13 The idea is that each module that is apphcable to the LE task un- 
der development (IE, MT, . . . ) is given a button in a large canvas window. 
Clicking the button will run the process associated with the button. Figure ^ 
shows a small example. Here we have a choice of whether to run the Brill or 
POST taggers, both of which may produce results required by the BUChart 
parser, or the Xerox tagger which will not produce results appropriate to the 
parser. 




Figure 6: GGI objects graph example 



Interestingly, this arrangement for viewing and launching chains of LE mod- 
ules quite closely parallels the braided evaluation model proposed in (Crouch, 
Gaizauskas, Netter 1995). This suggests that implementation of the evalua- 
tion schemes discussed there and in (Galliers, Sparck Jones 1993; Sparck Jones 

1994) might be facilitated by GATE. 

^^Thanks particularly to Kevin Humphreys for this idea. 
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The first task facing a user is to open a collection. It is not mandatory to then 
open a specific document - functions may be run on whole collections as well 
as single documents (though it may be appropriate to have a warning dialog 
box as large batched runs "may take a little while"). There is a file menu 
in the top left corner, with open collection leading to a list of collections, 
followed by a list of documents. 

This is a view of the system as a set of processes that may be linked in pipelines 
in various ways. The permissible paths through the graph depend on the data 
that a module requires as its input. GATE systems are built from combinations 
of modules chained together. These chains may be represented by highlighted 
paths through the graph (e.g. selecting LaSIE from a systems menu would 
highhght the arcs connecting the various LaSIE object nodes). Clicking a 
module within a chain will then run all those in the remaining portion of the 
chain. 

Visual information regarding the data present in the system is represented in 
two ways: by colour-coding of the module buttons and by colour-coding of the 
result (with an associated colour key). 

The module buttons change colour depending on whether the data that the 
process produces is present or not: 

green ready to run, data not present; 

amber requires data from a previous stage to run; 

red data available (process has already been run successfully on current doc- 
ument / collection. 

Clicking on a green button runs the relevant CREOLE object (via a pop-up 
dialog if options need setting) ; clicking on amber generates a menu of possible 
preceding modules to run; click red and a menu of results to view is displayed. 
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An optional pop-up launched from the file menu displays the output of modules 
as they run. 

Given that the set of CREOLE modules will be large it will be necessary to 
allow a large screen space for the graph, and for it to be X and Y scrollable. 
Perhaps processing stages should be collapsible. Note that the implementation 
of the display will be non-trivial, and will require the use of some drawing 
algorithm like that used in da Vinci (Frolich, Werner 1994). 

Further information regarding the data produced by CREOLE objects is de- 
livered by colour-coded displays of documents, e.g. a text might be displayed 
with coreference chains displayed in green. Each type of result also specifies a 
colour key, to be displayed on a bar with the result viewer. 

The implementation of the processes graph should be via configuration in- 
formation supplied with each CREOLE wrapper - i.e. there should be no 
information hard-coded into GATE regarding different modules. This might 
be achieved for example by each object registering its name, version and re- 
sult type.[^ It should also be possible to specify standard ways for results to 
be displayed (via an annotation type/colour key table, for example). These 

details need more work. 

^^Each object then also specifies a set of preconditions in tlie form of regular expressions 
matching these annotations, e.g. the Brill tagger might store 

• brill-0.1 pos_tags 

and a parser that required the tags to run might then specify 

• brill-* pos_tags, or 

• * pos-tags, or 

• (brill) — (post)-* pos-tags. 
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Implementation technology 



GATE is implemented in a mixture of Tcl/Tk and C[++]. The glue be- 
tween the various components is Tel, a script language developed specifically 
for systems integration (Ousterhout 1994). In common with other script lan- 
guages, hke Perl or the Bourne shell plus UNIX utilities, Tel provides high-level 
constructs and facilities that greatly simplify the implementation of simple 
systems. Unlike other script languages, Tel also has an extremely clean C 
interface, allowing seamless integration of C[-|— 1-] libraries with Tel scripts. 
gate's own code, then, is Tel or C[-|— 1-] (though this in no way restricts the 
implementation technology used in CREOLE modules - see above) . 

Another reason for choosing Tel is the Tk package that comes with it. Tk 
is a Tel library that encapsulates the MOTIF X- Windows toolkit. Whereas 
programming X via C is a black art that has spawned legions of expensive and 
complex screen-painting utilities, scripting Tk is a simple interactive process. 
The initial GGI prototype was coded in less than a week by a novice Tk 
programmer. 

Tcl/Tk are public domain and are under active development by Sun Microsys- 
tems. Forthcoming changes include cross-platform portabihty across UNIX/X, 
MS-Windows and Macintosh. 

The GDM API, then, is a set of Tel calls. These calls are generally also 
available in C[-|— 1-] . 

All GATE systems are 8-bit clean, and may therefore be used with languages 
that can be represented by 1 byte character sets. Multi-byte character support 
is highly desirable (probably via the UNICODE standard). A route to 16 bit 
capability might be via a replacement for the Tel string functions (maybe 
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using the Tools. h++ library (Keffer 1995)) and via reimplementation of the 
Tk text widget (or integration of CRL's Motif widget). 
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